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Abstract 

The Nystrom method is routinely used for out-of-sample extension of kernel ma- 
trices. We describe how this method can be applied to find the singular value de- 
composition (SVD) of general matrices and the eigenvalue decomposition (EVD) of 
square matrices. We take as an input a matrix M G M. mxn , a user defined integer 
s < min(m,n) and Am £ M sxs , a matrix sampled from the columns and rows of 
M. These are used to construct an approximate rank-s SVD of M in O (s 2 (m + n)) 
operations. If M is square, the rank-s EVD can be similarly constructed in O (s 2 n) 
operations. Thus, the matrix Am is a compressed version of M. We discuss the choice 
of Am and propose an algorithm that selects a good initial sample for a pivoted version 
of M. The proposed algorithm performs well for general matrices and kernel matrices 
whose spectra exhibit fast decay. 

Keywords: {Compression, SVD, EVD, Nystrom, out-of-sample extension} 

1 Introduction 



Low rank approximation of linear operators is an important problem in the areas of scientific 
computing and statistical analysis. Approximation reduces storage requirements for large 
datasets and improves the runtime complexity of algorithms operating on the matrix. When 
the matrix contains affinities between elements, low rank approximation can be used to 
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reduce the dimension of the original problem Q2SJ H2J [23]) and to eliminate statistical noise 



Our approach involves the choice of a small sub-sample from the matrix, followed by the 
application of the Nystrom method for out-of-sample extension. The Nystrom method ([2]), 
which originates from the field of integral equations, is a way of discretizing an integral 
equation using a simple quadrature rule. When given an eigenfunction problem of the form 



the Nystrom method employs a set of s sample points y±, . . . , y s that approximate f(x) as 



In recent years, the Nystrom method has gained widespread use in the field of spectral 
clustering. It was first popularized by [32J for sparsifying kernel matrices by approximating 
their entries. The matrix completion approach of [T7] also enables the approximation of 
eigenvectors. It was now possible to use the Nystrom method in order to speed up algorithms 
that require the spectrum of a kernel matrix. Over time, Nystrom based out-of-sample 
extensions have been developed for a wide range of spectral methods, including Normalized- 
Cut ([IH1E]), Geometric Harmonics ([H]) and others ([BJ). 

Other noteworthy methods for speeding up kernel based algorithms, which are not appli- 
cable to the proposed setting of this paper, are based on sampling pQ, convex optimization 
[9] and integral equations. ACA [31 0] is an important example in the latter category. ACA 
can be regarded as an efficient replacement of the SVD which is tailored to asymptotically 
smooth kernels. The kernel function itself is not required. ACA uses only few of the original 
entries for the approximation of the whole matrix and it was shown to have exponential 
convergence when used as part of the Nystrom method. 

In this paper, we present two extensions of the matrix completion approach of [17] . These 
allow us to form the SVD and EVD of a general matrix through the application of the 
Nystrom method on a previously chosen sample. 

In addition, we present a novel algorithm for selecting the initial sample to be used with the 
Nystrom method. Our algorithm is applicable to general matrices whereas previous methods 
focused on kernel matrices. The algorithm uses a pre-existing low-rank decomposition of the 
input matrix. We show that our sample choice reduces the Nystrom approximation error. 

The paper is organized as follows: Section [2] describes the basic Nystrom matrix form 
and the methods of [T7J for finding the EVD of a Nystrom approximated symmetric matrix. 



([22]). 




M(x,y)f(y) dy, 



Xf(x)^—J2M(x,y j )f(y j ). 
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Section [3] outlines a Nystrom-like method for out-of-sample extension of general matrices, 
starting with the SVD of a sample matrix. In section [4] we describe procedures that explicitly 
generate the canonical SVD and EVD forms for general matrices. Section [5] introduces the 
problem of sample choice and presents results that bound the accuracy of the algorithm in 
section [6] Section [6] presents our sample selection algorithm and analyzes its complexity. 
Experimental results on general and kernel matrices are presented in section [7} 

2 Preliminaries 



2.1 Square Nystrom Matrix Form 

Let M G M. nxn be a square matrix. We assume that the M can be decomposed as 



M 



Fm 



Bm 
Cm 



(1) 



where A M G W XS ,B M G M sx(n - s) ,F M G R^-^ 8 and C M G R^M™-*). The matrix A M is 

designated to be our sample matrix. The size of our sample is s, which is the size of Am- 

Let UAU -1 be the eigen-decomposition of Am, where U G M sxs is the eigenvectors matrix 

and A G M. sxs is the eigenvalues matrix. Let u l G M s be the column eigenvector belonging 

to eigenvalue Aj. We aim to extend the column eigenvector (the discrete form of an eigen- 

r 1 T 

function) to the rest of M. Let u % = u % u L G 1R™ be the extended eigenvector, where 
u l g M. n ~ s is the extended part. By applying the Nystrom method to u l , we get the following 
form for the k th coordinate in u l : 



(2) 



3=1 



By setting [a, b] = [0, 1] and presenting Eq. ^ in matrix product form we obtain 

1 



XiU 1 



-F M ■ u\ 



(3) 



u 1 



u 



This can be done for all the eigenvalues {Aj}| =1 of A M - Denote U = 
j^(n-s)xs gy pi ac j n g a }} expressions of the form Eq. ^ side by side we have UA = FmU. 
Assuming the matrix Am has non-zero eigenvalues (we return to this assumption in section 



5.4), we obtain: 



U = FmU A 



(4) 
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Analogically, we can derive a matrix representation for extending the left eigenvectors of M, 
denoted as V G R sxn - S : 

V = A- 1 U- 1 B M . (5) 

Combining Eqs. Q and ([5]) with the eigenvectors of Am yields the full left and right 
approximated eigenvectors: 



U 



u 

FmUA- 1 



V 



U- 1 A^U^B 



M 



(6) 



The explicit "Nystrom" representation of M becomes: 

U 



M = UAV 



FmUA- 1 



A 



U- 1 A^U^B 



M 



A 



M 



B 



M 



Fm F m A m B m 



A m 
Fm 



(7) 



.4 



M 



Am Bm 



where A M denotes the pseudo-inverse of Am- 

Equation Q shows that the Nystrom extension does not modify Am, Bm and Fm, and 
that it approximates Cm by F M A M B M - 

2.2 Decomposition of Symmetric Matrices 

The algorithm given in |il7] is a commonly used method for SVD approximation of symmetric 
matrices. For a given matrix, it computes the SVD of its Nystrom approximated form. The 
SVD and EVD of a symmetric matrix coincide up to the signs of the singular (eigen-) values. 
Therefore the SVD can approximate both simultaneously. We describe the method of p2] 
in section 12.2.21 



2.2.1 Symmetric Nystrom Matrix Form 

When M is symmetric, the matrix M has the decomposition 



M 



Am Bm 
F> M Cm 



where A M e R SXS ,B M G M sx ( n - S ) and C M G R^M™-*). We replace F M in Eq. Q with 



B T M . 
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By using reasoning similar to section 2.1, we can express the right and left approximated 
eigenvectors as: 

U 



U 



V 



U- 1 A^U^B 



M 



The explicit "Nystrom" representation of M becomes: 



M = UAV 



U 

B T M UA-i 



A 



A 



U- 1 A- l U~ l B M 



M 



A 



M 



A-m B M 



A 



M 



B 



M 



b m B t m A m B m 



(9) 



(10) 



2.2.2 Construction of SVD for Symmetric M 

Our goal is to find the s leading eigenvalues and eigenvectors of M without explicitly forming 
the entire matrix. 

We begin with the decomposition of M as in Eq. ([8]). The approximation technique 
in [T7] uses the standard Nystrom method in Eq. ^ to obtain U . Then, the algorithm 
forms the matrix Z = UA 1 / 2 such that M = ZZ T = UAU T . The symmetric s x s matrix 
Z T Z is diagonalized as FTiF T . The eigenvectors of M are given by U = ZFH~ 1 / 2 and the 
eigenvalues are given by S. To qualify for use in the SVD, U and S must meet the following 
requirements: 

1. The columns of U must be orthogonal. Namely, UjU = I. 

2. The SVD form of U Q and £ must form M. Formally, M = U TjUJ . 

The following identities can be readily verified using our expressions for U and S: 

1. Bi-orthogonality: UjU a = T,- l l 2 F T Z T ZFT,- 1 ' 2 = S" 1 / 2 ^ (FSF T ) FY,' 1 ' 2 = /; 

2. SVD form: U EUj = ZF^ 2 ■ S ■ ^- 1 / 2 F T Z T = ZZ T = M. 

The computational complexity of the algorithm is O (s 2 n), where s is the sample size and 
n is the number of rows and columns of M. The bottleneck is in the computation of the 
matrix product Z T Z. 
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2.2.3 A Single-Step Solution for the SVD of M 

The "one-shot" solution in [T7j assumes that Am has a square root matrix A M 2 . This 
assumption is true if the matrix is positive definite. Otherwise, it imposes some limitations 
on A M . These will be discussed in section |4~3 



.4 



M 



Let A^ 2 be the pseudo-inverse of the square root matrix of Am- Denote G T = A^ 

From this definition we have M = GG T . The matrix S G R sxs was defined in [17] . where 
S = G T G = A M + A M X ^ 2 B M B M A~^ 2 . S is fully decomposed as UsA s Uj. The orthogonal 

— 1/2 

eigenvectors of M are formed as U Q = GUsA s and the eigenvalues are given in As- 



The following required identities, as in section 2.2.2 can again be verified as follows: 



1. Bi-orthogonality: 

U?U = A S 1/2 UJG T GU S A S 1/2 = A S 1/2 UJSU S A S 1/2 = A s 1/2 Uj ■ U s A s Uj ■ U S A S 1/2 = 
I. 

2. SVD form: U a A s Uj = GU S A S 1/2 ■ A s ■ A s 1/2 UjG T = GG T = M. 

The computational complexity remains the same (the bottleneck of the algorithm is the 
formation of BmB m ). However this version is numerically more accurate. According to pT 



the extra calculations in the general method of solution lead to an increase in the loss of 
significant digits. 



3 Nystrom-like SVD approximation 

The SVD of a matrix can also be approximated via the basic quadrature technique of the 
Nystrom method. In this case, we do not require an eigen-decomposition. Therefore, M 
does not necessarily have to be square. Let M G ]R mxn be a matrix with the decomposition 
given in Eq. Q. We begin with the SVD form Am = UAH where U, H G IR SXS are unitary 
matrices and A G IR SXS is diagonal. We assume that zero is not a singular value of Am- 
Accordingly, U can be formulated as: 

U = A M HA-\ (11) 
Let u l , h % G M s be the i th columns in U and H, respectively. Let u l = {u\} s l=1 be the 



partition of u l into elements. By using Eq. (11), each element u\ can be presented as the 
sum u\= £ £"=i My * 



We can use the entries of Fm as interpolation weights for extending the singular vector 

s+l 



u % to the k th row of M, where s + 1 < k < n. Let u % = {u % k _ s }1 =s+1 G K" s be a column 
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vector that contains all the approximated entries. Each element u\_ s will be calculated as 



k—s 



~t X]j=i ■ h l j. Therefore, the matrix form of u % becomes u % = >/ 



Putting together all the u l 's as U 



u 2 



u 



-1 



l"- sxs , we get U = F M HA- 
The basic SVD equation of Am can also be written as H = A M UA~ l . We approximate 
the right singular vectors of the out-of-sample columns by employing a symmetric argument. 
We obtain H = B^UA' 1 . 

The full approximations of the left and right singular vectors of M, denoted by U and H, 
respectively, are 

TT \ U 

(12) 





u 




H 


u = 


, H = 


_ B T M UA-t _ 


_ FmHA- 1 _ 





The explicit "Nystrom" form of M becomes 
M = UAH T = 



U 

FmHA- 1 



A 

A M 
Fm 



H T A~ l U T B 



M 



Am Bm 
Fm FmA m Bm 



(13) 



.4 



M 



A M B 



M 



where A M denotes the pseudo-inverse of Am- M does not modify Am,Bm and Fm but 
approximates Cm by FmA m Bm- Note that the Nystrom matrix form of the SVD is similar 
to Eq. ([7]), which is the Nystrom form of the EVD matrix. 



4 Decomposition of General Matrices 

We will refer to a decomposition of M given in Eq. Q with the corresponding decomposition 
into Am, Bm, Fm and Cm- M denotes the approximated Nystrom matrix. 

This section presents procedures for explicit orthogonalization of the singular-vectors and 
eigenvectors of M. Starting with M in the form of Eqs. ^ and (13), we find its canonical 
SVD and EVD form, respectively. Constructing these representations takes time and space 
that are linear in the dimensions of M. 



4.1 Construction of EVD for M 

Let M be a square matrix. We will approximate the eigenvalue decomposition of M without 
explicitly forming M. 

We begin with a matrix M that is partitioned as in Eq. 0. By explicitly employing 
the Nystrom method, we construct U and V as defined in Eq. ([6]). Then, we proceed by 
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defining the matrices Gjj = UA 1 ^ 2 and Gy = A 1//2 V '. We directly compute the EVD of 
G v Gu as FEF- 1 . The eigenvalues of M are given by E and the right and left eigenvectors 
are U D = Gt/FE~ 1/2 and V a = E^F^Gy, respectively 

The left and right eigenvectors are mutually orthogonal since 

V a U = rr 1/2 F -1 Gy ■ GuFY,- 1 ' 2 = Y,- 1/2 F- 1 ■ FEF" 1 • F£" 1/2 = I. 

The EVD form of U ,V and E gives M, as we see from 

UoEV = GuFT,- 1 ' 2 ■ E • E -1 / 2 F _1 Gy = Gf/Gy = ?7A 1/2 • A 1/2 y = M. 

These two properties qualify U T,V as the EVD of M. 

When M is symmetric, the matrix Gy is simply Gjj. By using the terminology in section 
2.2.2, we denote Gy = Z and the matrix GyGu is transformed into ZZ T . From here on the 



method of solution in section 2.2.2 coincides with the current section. Hence, this form of 
EVD approximation generalizes the symmetric case. 

The computational complexity is 0(s 2 n), where s is the sample size (the size of Am) and 
n is the size of M. The computational bottleneck is in the formation of GyGu- 



4.1.1 A Single-Step Solution for the EVD for M 



i/2 

This solution method assumes that Am has a square root matrix A M . From this assumption 



we can modify the algorithm in section 4.1 to construct the EVD of M with fewer steps. 
We define the matrices Gjj and Gy to be 



G v 



A M 
Fm 



A 



-1/2 
M i 



Gx 



A 



-1/2 
M 



Am B 



M 



We proceed to explicitly compute the eigen-decomposition of GyGu £ M sxs as GyGu = 
FEF -1 . The eigenvalues of M are given by E and the right and left eigenvectors of M are 
formed by U = Gf/FE" 1 / 2 and V a = E -1 / 2 F _1 Gy, respectively. A gain, we can verify the 
eigenvectors are mutually orthogonal: 

V U = E _1 / 2 F _1 GyG C /FE -1 / 2 = Y,~ l l 2 F~ l -GyGu-FY.^ 1 ! 2 = Z^F^-F^F^-F^- 1 / 2 = I, 



and the matrices U , V and As form M as 
U A s Vo = GuFY,- 1 ' 2 ■ E • Tr 1/2 F~ l Gy = G V G V 



A M 
Fm 



.-1/2.-1/2 

M M 



Am Bm 



M. 
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The reduction to the symmetric case is straightforward here as well. We have Gy = G T 



u 



when M is symmetric. By using the terms of section 2.2.3, we have Gjj = Gy = G. The 
expression GyGy turns into G T G. After that point the methods of solution coincide. 
Again, the algorithm takes 0(s 2 n) operations due to the need to calculate GyGu- Com- 



pared to the solution given in section 4.1, the single-step solution performs fewer matrix 



operations. Therefore, it achieves better numerical accuracy. 
4.2 Construction of SVD for M 

Let M be a general mxn matrix with the decomposition in Eq. ([I]). Given an initial sample 
A M , we present an algorithm that efficiently computes the SVD of M (defined by Eq. ([7])). 
We explicitly compute the SVD of Am and use the technique outlined in section [3] to 



obtain U and H as in Eq. (12). We form the matrices Zy = UA 1 / 2 and Zh = His}/ 2 . 
We proceed by forming the symmetric s x s matrices ZjjZu and Z h Zh and compute their 
SVD as ZjjZu = F U T, U F^ and Z H Z H = F H T> H F]j, respectively. The next stage derives an 
SVD form for the s x s matrix D = S^/ 2 'Fy 'FhT}Jj 2 . This is given explicitly by computing 
D = UdAdHq. The singular values of M are given in Ad and the leading left and right 
singular vectors of M are U Q = ZuFxjYj^ Ud and H Q = Z H F H Y> H H D , respectively. The 
columns of U Q and H Q are orthogonal since 

UjU = UpYijj 1 2 F^ Zrfj ■ ZxjFjjYiy 2 U D = U^Ej/^F^ ■ Fu'EuF^ ■ F u H u l 2 U D = UpU D = I, 

HjH = HqY,h 7 'F]jZ H -Z H F H T, I ^^ 2 H D = HpYip 1 2 F H '-F H Y, H F]j-F H 'E H 1 2 H D = H^H D = I. 
The SVD of M is formed by using U , H Q and Vd 

U Ad HJ = ZuFijYj^^XJd ■ Ad ■ H D ^L H 1 ^ 2 FjjZ H = ZuFuEy 1 ^ 2 ■ D ■ F^Zfj = 
= ZuFuE- 112 ■ ^FjjFu^i 2 ■ YT^ 2 FlZl = Z V Z T H = UA^ 2 ■ A^H T = M. 



When M is symmetric, this solution method coincides with the method in section 2.2.2 



The matrices Zu and Zh correspond to Z in section 2.2.2 The matrix D becomes the 
diagonal matrix E of the symmetric case. The computational complexity of the procedure 
is O (s 2 (m + n)). The bottleneck is the computation of ZjjZu and ZJjZh- 

4.2.1 A Single-Step Solution for the SVD of M 

A 1/2 

This solution method assumes that Am has a square root matrix A M . Similar to section 



4.1.1, this assumption allows us to modify the algorithm of the general case to achieve the 



same result in fewer steps. 
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— 1/2 

Let A M be the pseudo-inverse of the square root matrix of Am- We begin by forming 
the matrices Gy and Gh such that 



G 



v 



A M 
Fm 



A 



-1/2 
M 



Am B 



M 



The symmetric matrices GjjGjj and G h Gh are diagonalized by GjjGjj = FuEuFjj and 
G h Gh = Fh^hFJj. From these parts we form D = Y>)j 2 F^ FhE 1 ^ 2 which is explicitly 
diagonalized as D = UdAqH^. The singular values of M are given by A D and the left 
and right singular vectors are given by U a = GuF u T,^ 7 1 ' 2 Ud and H a = GhFhY,^ 1 ^' 'H d , 
respectively. 



As in section 4J2 we can verify the identities that make this decomposition a valid SVD. 
The singular vectors are orthogonal: 

UjU = UJjTjjj 1 ^ 2 'FyGjj ■ GjjF u 'E u 1 / 2 Ud = U'£'£ U 1 ^ 2 F^ ■ FjjUijF^ ■ F u T, u 1 ^ 2 Ud = U^Ud = I, 

hJh q = H^ ) 'E h 1 ^ 2 FJ i GJ i -G h F h T ih 1 ^ 2 H d = HJ ) 11^ 2 F]j-F h Y1 h FJ i -F h T 1 ^ 2 H D = Hj)H D = I. 
The SVD is formed by U a , H Q and A D : 

U A D HJ = GuFuE^Un ■ A D ■ H^ H 1/2 F^G T H = G^E" 172 - " - ^ 1/2 



■D.YT H ^FlG\ 



,1/2 



-1/2 T?Tr<T 
H r H^H 



GjjGjj 



A M 
Fm 



.-1/2 .-1/2 
M M 



Am B m 

If M is symmetric, this method reduces to the single-step solution described in section 



2.2.3 The matrices Gu and Gh correspond to G in the symmetric case. The matrix D 
becomes As. 

The computational complexity of the procedure remains 0(s 2 (m + n)). The computa- 
tional bottleneck of the algorithm is in the formation of Gj^Gu. 



4.3 Prerequisite for the Single-Step method 



The single-step methods, described in sections [2.2.3[ |4.1.1| and |4.2.1[ require that Am have 
a square root matrix. 

When a matrix is positive semi-definite, a square root can be found via the Cholesky 
factorization algorithm ([19] chapter 4.2.3). But positive-definiteness is not a necessary 
prerequisite. For example, the square root of a diagonalizable matrix can be found via its 
diagonalization. If A M = UAU' 1 , then, A 1 / 2 = UA^U' 1 . In this case, the matrix does not 
need to be invertible. 
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It can be shown that under a complex realm, every non-singular matrix has a square root. 
An algorithm for calculating the square root for a given non-singular matrix is given in [7]. 
This suggests a way of assuring the existence of a square root matrix. We can make Am 
non-singular, or equivalently, a full rank matrix. 

The rank of Am will also have a role in bounding the approximation error of the Nystrom 



procedure. This will be elaborated in section 5.4 



5 Choice of Sub-Sample 

The choice of initial sample for performing the Nystrom extension is an important part in 
the approximation procedure. The sample matrix Am is determined by permutation of the 
rows and columns of M (as given in Eq. 0). Our goal is to choose a (possibly constrained) 
permutation of M such that the resulting matrix can be approximated more accurately by 
the Nystrom method. Here accuracy is measured by L2 distance between the pivoted version 



of M and the Nystrom approximated version. This notion is made precise in section pA 

We allow for complete pivoting in the choice of a permutation for M. This means that 
both columns and rows can be independently permuted. This kind of pivoting does not 
generally preserve the eigenvalues and eigenvectors of the matrix. However, the singular 
values of the matrix remain unchanged and the singular vectors are permuted. Formally, let 
E r and E c be the row and column permutation matrices, respectively. Using the SVD of 
M, the pivoted matrix is decomposed as E r ME c = E r UT>V T E c = (E r U) £ (V T E^. Row 
and column permutations leave U and V T unitary. Therefore (E r U) £ (V T E C ^ is the SVD 
of E r ME c . The singular vectors of M can be easily regenerated by permuting the left and 
right singular vectors of E r ME c by E^ 1 and E^ 1 respectively. 



Section 5.4 shows the choice of Am determines the Nystrom approximation error. Hence, 
the problem of choosing a sample is equivalent to choosing the rows and columns of M whose 
intersection forms Am- Therefore, it makes sense to use the size s of Am as our sample size. 
This size largely determines the time and space complexity of the presented approximation 
procedures. The complexities are O (s 2 (m + n)) and O (s(m + n)), respectively. 



5.1 Related Work on Sub-Sample Selection 

Previous works on sub-sample selection focused on kernel matrices. These were done for 
symmetric matrices where the entries represent affinities. In these settings, we can use a 
single permutation for the columns and rows without changing the original meaning of the 
matrix. This pivoting variant is called symmetric pivoting. Sample selection algorithms for 
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kernel matrices try to find a permutation matrix E p such that E^ME p is most accurately 
approximated by the Nystrom method. 

The simplest sample selection method is based on random sampling. It works well for 
dense image data Q17J). Random sampling is also used in [30J while employing a greedy 
criterion that helps to determine the quality of the sample. A different greedy approach 
for sample selection is used in |25j, where a new point is added to the sample based on its 
distance from a constrained linear combination of previously selected points. 

In [33], the fc-means clustering algorithm is used for selecting the sub-sample. The fc-means 
cluster centers are shown to minimize an error criterion related to the Nystrom approxima- 
tion error. Finally, Incomplete Cholesky Decomposition (ICD) ([IS]) employs the pivoted 
Choleksy algorithm and uses a greedy stopping criterion to determine the required sample 
size for a given approximation accuracy. 

The Cholesky decomposition of a matrix factors it into Z T Z, where Z is an upper trian- 
gular matrix. Initially, Z = 0. The ICD algorithm applies the Cholesky decomposition to 
M while symmetrically pivoting the columns and rows of M according to a greedy criterion. 
The algorithm has an outer loop that scans the columns of M according to a pivoting order. 
The results for each column determine the next column to scan. This loop is terminated 
early after s columns were scanned by using a heuristic on the trace of the residual Z T Z — M. 
This algorithm Q16J) approximates M. This is equivalent to a Nystrom approximation where 
the initial sample is taken as the intersection of the pivoted columns and rows. 

When M is a Gram matrix, it can be expressed as the product of two matrices. Let 
M be decomposed into M = X T X where X G ]R nxn . The special properties of M were 
exploited differently in [15J. Specifically, the fact that Ma is the norm of the column Xi is 
used. A non-Gram matrix requires 0(n 2 ) additional operations to compute XjXi, which is 
impractical for large matrices. Once the norms of the columns in X are known, a method 
similar to [H] is used to choose a good column sample from X. The intersection in M of 
the pivoted columns and the corresponding rows is a good choice for Am- The Nystrom 



procedure is then performed similarly to what was described in section 2.2.2 The runtime 
complexity of the algorithm in [15] is 0(n). 



5.2 Preliminaries 

Definition 5.1. Approximate 'thin' Matrix Decomposition. Given a matrix M G 
R rax ° A "thin" matrix decomposition is an approximation of the form M = GS where 
G G R mxk , S G R kxn and k < min(m,n). 

This form effectively approximates M using a rank-fc matrix product. A good example 



12 



for such an approximation is the truncated rank-fc SVD. It approximates am x n matrix as 
UAV T , where U G M mxfe ,A G R kxk and V G R nxk . When this decomposition is employed, 
we can choose, for example, G = U, S = AV T . Many algorithms ([HI [131 ED ES]) exist for 
approximating the rank-/c SVD with a runtime close to 0(mn). 

Truncated SVD is a popular choice, but it is by no means the only one. Other examples 
include truncated pivoted QR ([SI]) or the interpolative decomposition (ID) as outlined in 

Definition 5.2. Numerical Rank. A matrix A has numerical rank r with respect to a 
threshold e if a r+ i{A) is the first singular value such that 

a r+1 {A) 

This definition generalizes the L 2 condition number {k 2 {A)), since it also applies to non- 
invertible and non-square matrices. 

Definition 5.3. Rank Revealing QR Decomposition (RRQR). Let A G R mxn fr e 

a matrix and let k be a user defined threshold. A RRQR algorithm finds a permutation 
matrix E such that AE has a QR decomposition with special properties. Formally, we write 
AE = QR such that Q is an orthogonal matrix and R is upper triangular. Let R have the 
following decomposition: 

Rn R\2 
R 22 

where R n G R kxk ,R 12 G M fcx ( n - fc ) and R 22 G R(™- fe )x("- fe ). i et p (k,n) be a fixed non- 
negative function bounded by a low degree polynomial in k and n. A RRQR algorithm tries 
to permute the columns of A such that 

o-k (Rn) > °x (R22) < (J k+l {A) ■ p{k, n). 

p{K, n) 

An overview on this topic is given in [20J. 

The relation between A and R can shed some light on the rank-revealing properties of 
RRQR. Let AE = A\ A 2 be a partitioning of AE such that A\ contains the first k 
columns. The RRQR decomposition is rank-revealing in the sense that it tries to put a set 
of k maximally independent columns of A into A\. We formalize this statement with Lemma 

El 

Lemma 5.4. Assume that the RRQR algorithm found a pivoting of A such that a k {Rn) > 
(jfc {A)//3, where > 1. If A has numerical rank of at least k with respect to the threshold e, 



R 



(14) 



13 



then, the numerical rank of Ax (the first k columns of AE) is k with respect to the threshold 
P-e. 

l T 



Proof. The RRQR algorithm yields A\ = Q 



Ru 



. Since Q is orthogonal, it does 



not modify singular values. Therefore, we have o k (Ai) = a k 



Ru o 



combining the above with our assumption on the RRQR algorithm, we get 

P ■ «k (A) > o k (A) . 
The interlacing property of singular values (Corollary 8.6.3 in [19]) gives us 

a x (A) > a x (A,) . 



a k (Ru). By 
(15) 



(16) 



By employing definition 5.2 for A and incorporating Eqs. (|15|) and (16), we get 



a k (A) ~ a k (A) ~ p-a k {A 1 )' 



By rearranging terms, we get 



ox (A 



Ok (Ad 

Therefore the numerical rank of Ax is at least k with respect to the threshold /3 ■ e. Since Ax 
has only k columns, it has precisely this rank. □ 



5.3 Algorithm Description and Rationale 

Initially, our algorithm decomposes the matrix M into G ■ S. Then, a RRQR algorithm 
chooses the s most non-singular columns of G T and S and insert then into G T A and Sa, 
respectively. We use a variant of RRQR that measures non-singularity according to the 



magnitude of the last singular value (see the proof of Corollary 5.8). The non-singularity of 



Ga and Sa will bound the non-singularity of GaSa (see Eq. (20)). 

On a higher level observation, the algorithm will try to perform an exhaustive search for 
the s x s most non-singular square in GS. However, since GS approximates M, choosing 
Am from the same rows and columns of M amounts to choosing one of its most non-singular 



squares. These notions are formalized in Theorem |5.6 

The magnitude of the last singular- value in A M , denoted by a s (A M ), will be used as a 
measure for the singularity of Am- This quantity is instrumental in defining the bound of 



the approximation error given in Theorem 5.12 We show in the experimental results section 
(section [7]) that empirically, cr s (Am) is strongly related to the approximation error of the 
Nystrom procedure. 
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5.4 Analysis of Nystrom Error 

Let M be a matrix with the decomposition given by Eq. (IT]). This partitioning corresponds 

to sampling s columns and rows from M to form the matrix Am- Our error analysis depends 

on an approximate decomposition of M into a product of two 'thin' matrices. Let M ~ GS 

be a decomposition of M where G G ]R mxs and S G IR sxn . The approximation error of M 

r 1 T 

by GS is denoted by e s . Formally, ||M — GS\\ 2 < e s . Let G = Ga Gb be a row 
partitioning of G where G A G R sxr and G B G R( m - S ) xr . Let S = [ S A S B } be a column 
partitioning of 5 where G IR rxs ,>SB G ]R rx ( ?1_s ). This notation yields the following forms 
for the sub-matrices of M: 

A M ~G A S A , B m ~G a S b , F m ~GbS a , C m ~G b S b . (17) 

where Am, Bm, Fm and Cm were defined in Eq. [Tj 

Lemma 5.5. (based on Corollary 8.6.2 in [19]) If A and A + E are in W nxn then for 
k < min (m, n) we have {A + E) — (A) | < <j\ (E) = \ \E\\ 2 . 

Proof. Corollary 8.6.2 in [19] states the same lemma with the requirement m > n. If m < n, 
we can use the original version of the lemma to get | cr^. (A T + i? T ) — a\. (^4 T ) | < | \ F T \ \ 2 - 
Transposition neither modifies the singular values nor the norm of a matrix. □ 

Theorem 5.6. Assuming that 

1. a s (GS) > 0; 

(This means that GS is of rank at least s. Otherwise, a non-singular Am cannot be 
found) 

2. a s (G) a s (S) = a s (GS) /j for some constant 7 > 1; 

(It will allow us to use the non- singularity of Ga and Sa as a bound for the non- 
singularity of GaSa- This demands the initial decomposition to be reasonably well 
conditioned. See Corollary 5. 7| for details) 

3. ct s (Ga) > c s (G)/ (3 and o~ s (Sa) > o~ s (S)/(3 for some constant {3 > 1; 

(This will allow us to use a s (Am) as a bound for a s (GS). The RRQR algorithm will 
fulfill this assumption in its choice of Ga and Sa) 

4- e s < (a s (M) — e s ) //3 2 7, where e s is the error given by the ranks approximation of M 
by GS. 

(The initial ranks approximation should be good enough) 
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Then, Am is non-singular. 



Proof. Lemma 5.5 yields \a s (M) — a s (GS)\ < \ \M — GS\\ 2 = e s , or 

as (M) - e s < a s (GS) . 
From assumptions [2] and [3] we obtain 

a s (GS) //3 2 7 < o- s (G) a s (S) /(3 2 < a s (G A ) a s (S A ) 



(19) 

Ga and Sa are s x s matrices. Assumptions [TJ [2] and [3] show that a s (G a) and cr s (G a) are 
non-zero. Thus, Ga and Sa are non-singular and we obtain 

1 1 1 



a s (G A ) cr s (S / 



Is 1 " 1 ! 



< 



I c — 1 1 

Pi 



By combining Eqs. (18), (19) and (J20J) we get 

(a s (M) - e.) //3 2 7 < o s (G A S A ) • 



a s (G A> S A ) . (20) 



(21) 



Am and GaS a are the top left s x s corners of M and GS, respectively. Hence, we can write 
\\Am — GaSa\\ 2 ^ \ \M — GS\\ 2 = e s . By combining this expression with Eq. (21 ) and using 
assumption |ij we have \ \A M — G a Sa\\ 2 ^ a s (GaSa)- Equivalently, 



\Am — GaS a \ 



< 



(22) 



The matrix GaSa is non-singular since it is the product of the non-singular matrices Ga 
and Sa- Equation 2.7.6 in [19] states that for any matrix A and perturbation matrix AA we 
have 



mm 

yl+AA singular 



l|AA|| 



^2 (A) • i ->■ • ^'"o"«" .. i-||2 

This equation in effect gauges the minimal L 2 distance from A to a singular matrix. By 
setting GaSa = A in Eq. (22) we conclude that A M is non-singular. □ 

Assumption [2] can be verified for different types of rank-s approximations of M. For the 



approximated SVD we have Corollary 5.7 



Corollary 5.7. When the approximated SVD is used to form GS, we have 7 = 1 (where 7 
is defined by assumption^ in Theorem 5.6). 

Proof. Let M ~ UY,V T be the approximated SVD of M. We can choose G = UH and 
S = V T . From the properties of the SVD, we have a s (G) = a s (WE) = E ss = a s (GS) and 
a s (S) = 1. It follows that a s (G) a s (S) = a s (GS). □ 
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Similarly, the (3 in assumption [3] depends on the algorithm that is used to pick Ga and 
Sa from within G and S, respectively. When a state-of-the-art RRQR algorithm is used, we 



derive Corollary 5.8 



Corollary 5.8. When the RRQR version given in Algorithm 1 in J2b] is used to choose 
Ga and Sa, we have (3 < a/ s (min (m, n) — s) + 1 , where (3 is defined by assumption^ in 
Theorem \5.6\ 



Proof. Let A E 



snxt 



be a matrix where k < n and a let A 



A x A 2 



be a partition of A 

where Ai G M> hxk . The concept of local /i-maximum volume was used in to find a pivoting 
scheme such that a m i n (A\) is bounded from below. Formally, Lemma 3.5 in [26] states that 
when A\ is a local /x-maximum volume in A, we have cr m j n (Ai) > af. (A) / yfk (n — k) /i 2 + 1. 
H is a user-controlled parameter that has negligible effect in this bound. For instance, [26] 
suggests setting ji — 1 + u, where u is the machine precision. Therefore, we omit fi in 
subsequent references of this bound. 

Algorithm 1 in [26] describes how a local /^-maximum volume can be found for a given 
matrix A. This algorithm can be applied to the choice of Ga and S A from the rows of G and 
S T , respectively. It follows from Lemma 3.5 in [2S] that a s (Ga) > o~ s (G) / a/s (m — s) + 1 
and a s (Sa) = cr s (^a) — a s (S T ) I a/s (n — s) + 1 = a s (S) / a/s (n — s) + 1. The definition 
of /3 yields the required expression. □ 

Later the RRQR algorithm will be used to select G T A and Sa as columns from G T and S, 
respectively. This is equivalent to choosing rows from G and S T . The latter form was used 
for compatibility with the notation of [26J. 

Theorem |5.6 states that if our rank-s approximation of M is sufficiently accurate and our 
RRQR algorithm managed to pick s non-singular columns from G T and S, then our sample 
matrix Am is non-singular. 

We bring a few definitions in order to bound the error of the Nystrom approximation proce- 
dure. We will decompose the matrix M into a sum of two matrices: M\ g that contains the en- 
ergy of the top s singular values and M sm that contains the residual. If M ig and M sm are given 
in SVD outer product form, then we have M ig = YH=i a i u i v i and M sm = Yli^s+l °~i u i v i, 
respectively. Based on this decomposition, we define the following decompositions of Mi g 



and M=. 



M = Mi g + M. 



A M 


Bm 




Alg 


Big 


+ 


A R 

^sra J -'sm 


Fm 


Cm 




_ F lg 


Clg _ 


Fsm C sm 



(23) 
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Lemma 5.9. If all the assumptions of Theorem 5.6 hold and if we have 



, „ a s (M) - e s 
a s+1 (M) < ' - e s 

Pi 



(24) 



(where e s is defined by assumption^in Theorem 5.6), then Ai g is non-singular. 



Proof. We employ Lemma 5.5 to bound |<r s (Am) — o~ s (G A S A )\. Formally, we have 



Ws (A M )-a s (G A S A )\ < \\A M -G A S A \\ 2 < \\M - G s \\ 2 = e s . 
By rearranging terms, we obtain a s (G A S A ) — e s < a s (Am)- Combining this expression with 



Eq. (21) from the proof of Theorem 5.6 yields 



<Js (M) - e s 

(3 2 1 



-e s <a s (A M ) 



(25) 



The quantity \\A M — A[ g \\ 2 can be bounded by \\A M — Ai g \\ 2 < \\M — Mi g \\ 2 = a s+1 (M). 



Combining the above with Eqs. (24) and (25) yields 



\A M -A lg \\ 2 < a s+1 (M) < 6s -e s <a s (A M , 



The terms are rearranged to get 



1-4 



M 



(26) 



where k is the standard L2-norm condition number. This expression is similar to Eq. (22) 



in the proof of Theorem 5.6 As before, if Am is non-singular, then Eq. (26) implies that 
Ai g is non-singular. □ 

We define the rank-s approximation of M that is based on the truncated SVD form of M\ g . 
Let M tg = U S J: S V S T be the truncated SVD of M. Denote X = U s J: s and Y = V? such that 



Mi 



hi 



XY. We define X 



X A X 



B 



and Y 



Y A Y, 



B 



where X A , Y A e 



'. We 



get the following forms for the components of Mi„: Ai„ = X A Y A , Bi q = X A Y B , F lg = X B Y A 



and C, 



hi 



XbYb- 



The Nystrom approximation error can now be formulated. 

Lemma 5.10. Assume that Am and Ai g are non-singular. Then, the error of the Nystrom 
approximation procedure is bounded by 



a s+1 (M) / a x (Mf 



os (A M ) \ ct s (Ai g ) 



+ 2a x (M) + a s+1 (M) . 



(27) 
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Proof. As seen from Eq. ( 13 ), the matrices Am, Bm and Fm are not modified by the Nystrom 
extension. Cm is approximated as FmA m B. Assuming that A is non-singular, then FmA m B 
is equivalent to FmA^B. The latter can be decomposed using the partitioning in Eq. (23): 



FmA m B — (F^ + F sm ) A M (Big + B sm ) — 

— Fi g A~ B^ + Fi g A~ x B sm + F sm A~ l Bi g + FsmA^Bg 



(28) 



Since Am and Ai g are non-singular, we have A M — A lg = —A lg (A — Ai g ) A M = —A lg A sm A M . 
The first term of Eq. (28) can be written as 

Fig A' 1 B ig = F^ (Ajg 1 - A^AnnAjf) B^ = F^ Al^ Big - FigA^ A sm A M x Big. (29) 

By our assumption, the matrices Xa and Ya are non-singular since A\ g = XaYa is non- 
singular. The first term of Eq. ( 29 ) becomes: 



FigA7 l Bi g = X B Y A (XaYaY 1 X a Y b = X b YaY^X- a x X a Y b = X B Y B = Q 



This means that Fi g A lg x Big is the best rank-s approximation to Cm, as given by the truncated 
SVD of M. We can bound the error by collecting all the other terms in Eqs. (28) and (29): 

F n y S = FigA^g A sm Aj^ Big + FigA B sm + F sm A Big + F sm A B sm . 



By the definition of M sm in Eq. (23), we have ||M sm || 2 < a s+1 (M). Therefore, we can 
bound | \A sm \ | 2 , | \B am \ | 2 and ||F sm || 2 by a s+ i(M). Similarly, ||-B/ 9 || 2 and [ | 2 are bounded 
by o~\ (M). The overall bound on ||£' n3/s || 2 is 



\E 



nys 1 1 2 



~FigA lg 1 A sm A M 1 Bi g + Fig A l B sm + F sm A 1 Bi a + F sm A 1 B, 



hi 



ig 



FigA-ig A sm A M Bi g \ | + | |-F} 5 -A 1 -B sm | | 2 + | \F sm A l Big\\ 2 -\-\\F sm A l B 



'sm 1 1 2 



< 
2 — 

< 



ai{M) 2 a s+1 (M) . gi (M)cr B+1 (M) , ai (M)cr s+1 (A/) . a s+1 (M) 2 
cr a (A M )a s (A lg ) <J S (A M ) <r s {A M ) <r s (A M ) 



Ts + l(M) / (71 (M) 



<rs{A M ) \a s (A lg ) 



2^ (M) + a s+l (M) 



□ 



Corollary |5.11| is derived straightforwardly: 

Corollary 5.11. If A M is non-singular and the matrix M is ranks, then, the Nystrom 
extension approximates M perfectly. 



Proof. If M is rank-s then Ai g = Am and the conditions in Lemma 5.10 hold. We obtain 
the result by setting a s+1 (M) = in Eq. 027b. □ 
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We proceed to express the Nystrom approximation error in relation to the parameters j3, 7 
and e s , as defined by the assumptions in Theorem 5.6 



Theorem 5.12. Assume that the assumptions of Theorem 5.6 hold as well as the assump- 
tions of Lemmas 5.S\ and 5.10 , The error term of the Nystrom procedure is bounded by: 



a s+1 (M) /3 2 7 



<ri (M) 2 /3 2 7 



a s (M) - (1 + /3 2 7 ) e s \ a s (M) - (1 + /3 2 7 ) e s - a s+1 (M) /3 2 7 



2a 1 (M) + a s+l (M) 



(30) 



Proof. We use Lemma 5.5 to obtain: 



\a s {A M )-a s (Aig)\ < \ \A M - A lg \\ 2 < \\M - M lg \\ 2 = a s+1 (M) . 
Equivalently, a s (Am) — c s+ i (M) < a s (Ai g ). We substitute a s (Am) with the left side of Eq 

as (M) - e 



(25) to get 



P 2 1 



-e s - a s+1 (M) < a s (A lg ) . (31) 

The result follows when the expressions for a s (A M ) and a s (A [g ) in Eq. (27) are replaced 
with the left sides of Eqs. (25) and (31), respectively. □ 



When Am is non-singular, the eigengap in the s th singular value governs the approximation 
n be seen 
Theorem 



5.12 



error. This can be seen from Eq. (|30|), where the eigengap appears in the expression 



(r a (M)-(l+/3 2 7 )e s 

in the limit case when the eigengap is infinite. 



bounds the general case. Corollary 



5.11 



shows what happens 



6 Sample Selection Algorithm 

Our algorithm is based on Theorem 5J3 and Corollaries 5.7| and 5^ It receives as its input 
a matrix M G ]R mxn and a parameter s that determines the sample size. It returns Am - a 



"good" sub-sample of M. If the algorithm succeeds, we can use Theorem |5.12| to bound the 
approximation error. The algorithm is described in Algorithm 1. 
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Algorithm 1 (M, s) 



1. Form a rank-s decomposition of M. Formally M ~ GS, where G G 



and S G 



2. Apply the RRQR algorithm to G T to find a column pivoting matrix such that 
G T A G T B \= G t Eq = Q G R G , where G A G W xs and G B G M sxm -*. Let I s be the 
group of indices in M that correspond to the first s columns of Eg- 



3. Apply the RRQR algorithm to S to find a column pivoting matrix E$ such that 

S A S B ]= SE s = Q s Rs, where S A G R sxs and S B G M sxn - S . Let J s be the group 
of indices in M that correspond to the first s columns of E$- 

4. if rank (Ga) ^ s or rank (Sa) ^ s then 

return 'Algorithm failed. Please pick a different value for s." 
end if 



5. Form the matrix Am G 
sample matrix. 



such that ^4 a/ 



VHei B ,jeJ B 



Returns Am as the sub- 



6.1 Algorithm Complexity Analysis 

Step[l]is the computational bottleneck of the algorithm and can take up to O (min (mn 2 , nm 2 )) 
operations if full SVD is used. Approximate SVD algorithms are typically faster. For exam- 
ple, the algorithm in [21] runs in O (mn) time, which is linear in the number of elements in the 
matrix. If we have some prior knowledge about the structure of the matrix, it can take even 
less time. For example, if an approximation of the norms of the columns is known, we can use 
LinearTimeSvd [H] to achieve a sub-linear runtime complexity of O (s 2 m + s 3 ). We denote 
the runtime complexity of this step by T approx . Using the RRQR algorithm in [20j, steps [2] 
and [3] in Algorithm 1 take 0(ms 2 ) and 0{ns 2 ) operations, respectively. Finally, the forma- 
tion of A M takes 0(s 2 ) time. The total runtime complexity becomes O (T approx + (m + n) s 2 ) 
and it is usually dominated by O {T approx ). 

Denote the space requirements of step [T] in Algorithm 1 by S approx . Then, the total space 
complexity becomes O (S approx + s (m + n)). Typically, a total of O ((m + n) s°^) space is 
used. 
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6.2 Relation to ICD 



Let M be decomposed into M = X T X where X e IR nxn . In this case, the R factor in the QR 
decomposition of X is the Cholesky factor of M since X = QR means that M = X T X = 
R T Q T QR = R T R. Similarly, the Cholesky decomposition of a symmetrically pivoted M 
corresponds to a column pivoted QR of X. The pivoting strategy used by the Cholesky 
algorithm in the ICD algorithm is the greedy scheme of the classical pivoted-QR algorithm 
in [8]. Applying ICD to M gives the R factor of the pivoted QR on X, and vice versa. The 
special structure of the matrix enables the ICD to unite steps TJ2 and [3] in Algorithm 1, 
creating a rank-s approximation to M while at the same time choosing pivots according to 
a greedy QR criterion. This allows the ICD to achieve a runtime complexity of O (s 2 n). 



7 Experimental Results 

In our experiments, we employ a fast but inaccurate sub-linear SVD approximation for 
step [T] in Algorithm 1. This approximated SVD first randomly samples the columns of the 
matrix. Then, it uses these columns in the LinearTimeSVD algorithm of [H] to compute an 
SVD approximation in O (s 2 m + s 3 ) operations. For this SVD algorithm, the total runtime 
complexity of Algorithm 1 is O (s 2 (m + n) + s 3 ) which is dominated by O (s 2 (m + n)). 



7.1 Kernel Matrices 

First, we compare between the performance of Algorithm 1 and the state-of-the-art sample 
selection algorithms for kernel matrices. We construct a kernel matrix for a given dataset, 
then each algorithm is used to choose a fixed sized sample. From the notation of Eqs. (jl|) 
and ([7]) , the error is displayed as M — M 



The following algorithms were compared: 1. The ICD algorithm presented in section 5.1 



2. The /c-means based algorithm presented in section |5.1| 3. Random choice of sub-sample 
as given in [IT] : 4. LinearTimeSVD of [H]; 5. Algorithm 1; 6. SVD. The SVD algorithm 
is used as a benchmark, since it provides rank-s approximation with the lowest Frobenius 
norm error. The empirical gain of our procedure can be measured by the difference between 
the approximation errors of LinearTimeSV D and Algorithm 1, since LinearTimeSV D is 
used in Step [T] of Algorithm 1. 

We use a Gaussian kernel of the form k(x,y) = exp (— \ \x — y\\ 2 /e) where e is the average 
squared distance between data points and the means of each dataset. Results for methods 
which contain probabilistic components are presented as the averages over 20 trials. These 
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include methods 2, 3, 4 and 5. The sample size is gradually increased from 1% to 10% of 
the total data and the error is measured in terms of the Frobenius norm. The benchmark 
datasets, summarized in Table [TJ were taken from the LIBSVM archive [10] . The overall 
experimental parameters were chosen to allow for comparison with Fig. 1 in [33] . 

The results are presented in Fig. [TJ Algorithm 1 generally outperforms the random sample 
selection algorithm, particularly on datasets with fast spectrum decay such as german.numer, 
segment and svmguidel a. In these datasets, our algorithm approaches and sometimes even 
surpasses the state-of-the-art fc-means based algorithm of [33]. This fits our derivation for 
the approximation error given by Theorem |5.12[ 

It should be noted that the algorithm in [26] has a runtime complexity of O (sn) compared 
to our O (s 2 n) for this setting. This difference has no real-world consequences when s is very 
small or even constant, as typical for these problems. 

In some cases, Algorithm 1 actually performs worse than LinearTimeSVD. We use 
a greedy RRQR algorithm which sometimes does not properly sort the singular-vectors 
according to their importance (namely, the absolute value of the singular- value). This can 
happen for instance when the spectrum decays slowly, which means leading singular values 
are close in magnitude. In Algorithm 1, we always choose the top s indices as found by the 
RRQR algorithm, so we might get things wrong. 

dataset german.numer splice adultla dna segment wla svmgdla satimage 

sample count 1000 1000 1605 2000 2310 2477 3089 4435 

dimension 24 60 123 180 19 300 4 36 

Table 1: Summary of benchmark datasets (taken from [TO] ) 



7.2 General Matrices 



We evaluate the performance of Algorithm 1 on general matrices by comparing it to a random 
choice of sub-sample. We use the full SVD as a benchmark that theoretically achieves the 
best accuracy. The approximation error is measured by M — M 

2 

The testing matrices in this section were chosen to have non-random spectra with random 
singular subspaces. Initially, a non-random diagonal matrix L is chosen with non-increasing 
diagonal entries. L will serve as the spectrum of our testing matrix. Then, two random 
unitary matrices U and V are generated. Our testing matrix is formed by ULV T . We 
examine two degrees of spectrum decay: linear decay (slow) and exponential decay (fast). 
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The error is presented in L 2 norm and we vary the sample size to be between 1%-10% of 
the matrix size. The presented results are from an averaging of 20 iterations to reduce the 
statistical variability. For simplicity, we produce results only for 500 x 500 square matrices. 

The results are presented in Fig. |2j When the spectrum decays slowly, Algorithm 1 has 
no advantage over random sample selection. It produces overall pretty bad results. But the 
situation is much different in the presence of a fast spectrum decay. Algorithm 1 displays 
good results when the sample size allows it to capture most of the significant singular values 
of the data (at a sample rate of about 3%). It is interesting to note that random sample 
selection does not lag far behind. This hints that, on average, any sample is a good sample 
as long as it captures more data than the numeric rank of the matrix. 



7.3 Non-Singularity of Sample Matrix 

We empirically examine the relationship between the Nystrom approximation error and the 
non-singularity of the sub-sample matrix. The approximation error is measured in L 2 -norm 
and the non-singularity of Am € W xs is measured by the magnitude of a s (Am)- We employ 
testing matrices similar to those in section ^2 These feature a non-random spectrum and 
random singular subspaces. The sample was chosen to be 5% of the data of the matrix. 
In this test, we compare between the random sample selection algorithm and Algorithm 
1. Each algorithm ran 100 times on each matrix. The results of each run were recorded. 
Figure [3] features a log-log scale plot of the approximation error as a function of a s (Am)- The 
performance of the different algorithm versions is compared. We arrive at similar conclusions 



to those in section 7.2 Our algorithms do no better than random sampling when the 
spectrum decay is slow, but consistently outperforms the random selection in the presence of 
fast spectrum decay. Figure [3] also shows a strong negative correlation between the variables 
in all the examined matrices. Hence, a large a s (Am) implies a small approximation error. 
The linear shape of the graphs, drawn in a log-log scale, suggests that this relationship is 
exponential. The results hint at a possible extension of the Nystrom procedure to a Monte- 
Carlo method: Algorithm 1 can be run many times. In the end, we choose the sample for 
which <j s (Am) is maximal. 



8 Conclusion and Future Research 

In this paper, we showed how the Nystrom approximation method can be used to find the 
canonical SVD and EVD of a general matrix. In addition, we developed a sample selection 
algorithm that operates on general matrices. Experiments have been performed on real- world 
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kernels and random general matrices. These show that the algorithm performs well when the 
spectrum of the matrix decays quickly and the sample is sufficiently large to capture most 
of the energy of the matrix (the number of non-zero singular values). Another experiment 
showed that the non-singularity of the sample matrix (as measured by the magnitude of the 
smallest singular value) is exponentially inversely related to the approximation error. This 
shows that our theoretical reasoning in Lemma 5.10 is qualitatively on par with empirical 
evidence. 

Future research should focus on additional formalization of the relationship between the 
smallest singular value of the sample matrix and the Nystrom approximation error. Another 
interesting possibility is to find a constrained class of matrices and develop a sample selection 
algorithm to take advantage of the constraint. Some classes of matrices may be easier to 
sub-sample with respect to the Nystrom method. 
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Figure 1: Nystrom approximation errors for kernel matrices. The X-axis is the sampling 
ratio given as sample size divided by the matrix size. The Y-axis is the approximation error 
given in Frobenius norm. The tested algorithms are: random, LinearTimeSVD, Algorithm 



1, ICD, k- means and SVD 
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Figure 2: Nystrom approximation errors for random matrices. The X-axis is the sampling 
ratio given as sample size divided by the matrix size. The Y-axis is the approximation error 
given in L2 norm. The tested algorithms are Random, LinearTimeSVD, Algorithm 1 and 
SVD. 
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Figure 3: Errors in Nystrom approximation as a function of a s (Am) 
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